Categories

Versions

Stem Tokens using ExampleSet (Operator Toolbox)

Synopsis

Replaces terms by pattern matching rules. This operator uses an ExampleSet to stem a list of words inside a ''Process Documents'' operator.

Description

This operator can be used in your ''Process Documents'' operator and allows to provide a custom list of tokens to be filtered out. It is like the Stem (Dictionary) operator, except the input here is an ExampleSet rather than a file.

It reduces terms to a base form using an external ExampleSet with replacement rules. The ExampleSet must contain a rule per line: targetExpression:pattern1 pattern2 ... where targetExpression is the term to which the input terms are reduced, if it matches any of the patterns. patternX is a simple string or a regular expression. A simple example would be a mapping like: weekday : .*day Please keep in mind, that very short words are filtered out in the default setting of the TextInput operators.

Input

  • doc

    The documents input port.

  • exa (Data Table)

    The ExampleSet with the tokens.

Output

  • doc

    The resulting document.

Parameters

  • attribute The name of the attribute that should be used for stemming. Range:

Tutorial Processes

Stem weekdays from a document

In this example we are replacing name of weekdays with the word ''weekday''.